{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b7e2255151321350",
   "metadata": {},
   "source": [
    "# Autoflow\n",
    "\n",
    "Autoflow is a RAG framework supported:\n",
    "\n",
    "- Vector Search Based RAG\n",
    "- Knowledge Graph Based RAG (aka. GraphRAG)\n",
    "- Knowledge Base and Document Management"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c3f49f",
   "metadata": {},
   "source": [
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7bbed79850462cfe",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:23.724019Z",
     "start_time": "2025-04-15T01:31:22.872381Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q autoflow-ai==0.0.2.dev5 ipywidgets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b6d5be6",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "- Go [tidbcloud.com](https://tidbcloud.com/) or using [tiup playground](https://docs.pingcap.com/tidb/stable/tiup-playground/) to create a free TiDB database cluster.\n",
    "- Go [OpenAI platform](https://platform.openai.com/api-keys) to create your API key."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66ea056f213efcae",
   "metadata": {},
   "source": [
    "#### For Jupyter Notebook\n",
    "\n",
    "Configuration can be provided through environment variables, or using `.env`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7de9ab2c65f1880e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:23.740325Z",
     "start_time": "2025-04-15T01:31:23.729076Z"
    }
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Check if the .env file is existing.\n",
    "if [ -f .env ]; then\n",
    "    exit 0\n",
    "fi\n",
    "\n",
    "# Create .env file with your configuration.\n",
    "cat > .env <<EOF\n",
    "TIDB_HOST=localhost\n",
    "TIDB_PORT=4000\n",
    "TIDB_USERNAME=root\n",
    "TIDB_PASSWORD=\n",
    "TIDB_DATABASE=test\n",
    "OPENAI_API_KEY='your_openai_api_key'\n",
    "EOF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b6cdb4d5",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:23.826376Z",
     "start_time": "2025-04-15T01:31:23.820064Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import dotenv\n",
    "\n",
    "dotenv.load_dotenv()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "6ddd696d4e1d9c78",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:24.238783Z",
     "start_time": "2025-04-15T01:31:23.836248Z"
    }
   },
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "from pandas import set_option\n",
    "\n",
    "set_option(\"display.max_colwidth\", None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8897854c897bf17",
   "metadata": {},
   "source": [
    "## Quickstart"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a38fde21",
   "metadata": {},
   "source": [
    "### Init Autoflow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "84f43a00",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:26.467866Z",
     "start_time": "2025-04-15T01:31:24.243001Z"
    }
   },
   "outputs": [],
   "source": [
    "from autoflow import Autoflow\n",
    "from autoflow.configs.db import DatabaseConfig\n",
    "from autoflow.configs.main import Config\n",
    "\n",
    "af = Autoflow.from_config(\n",
    "    config=Config(\n",
    "        db=DatabaseConfig(\n",
    "            host=os.getenv(\"TIDB_HOST\"),\n",
    "            port=int(os.getenv(\"TIDB_PORT\")),\n",
    "            username=os.getenv(\"TIDB_USERNAME\"),\n",
    "            password=os.getenv(\"TIDB_PASSWORD\"),\n",
    "            database=os.getenv(\"TIDB_DATABASE\"),\n",
    "            enable_ssl=False,\n",
    "        )\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5afea9b7",
   "metadata": {},
   "source": [
    "### Create knowledge base"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9e1ff63c",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:26.512535Z",
     "start_time": "2025-04-15T01:31:26.475394Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/json": {
       "class_name": "KnowledgeBase",
       "description": "This is a knowledge base for testing",
       "index_methods": [
        "VECTOR_SEARCH",
        "KNOWLEDGE_GRAPH"
       ],
       "name": "New KB",
       "namespace": "quickstart"
      },
      "text/plain": [
       "<IPython.core.display.JSON object>"
      ]
     },
     "execution_count": 6,
     "metadata": {
      "application/json": {
       "expanded": false,
       "root": "root"
      }
     },
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from autoflow.configs.knowledge_base import IndexMethod\n",
    "from autoflow.models.llms import LLM\n",
    "from autoflow.models.embedding_models import EmbeddingModel\n",
    "from IPython.display import JSON\n",
    "\n",
    "llm = LLM(\"gpt-4o-mini\")\n",
    "embed_model = EmbeddingModel(\"text-embedding-3-small\")\n",
    "\n",
    "kb = af.create_knowledge_base(\n",
    "    namespace=\"quickstart\",\n",
    "    name=\"New KB\",\n",
    "    description=\"This is a knowledge base for testing\",\n",
    "    index_methods=[IndexMethod.VECTOR_SEARCH, IndexMethod.KNOWLEDGE_GRAPH],\n",
    "    llm=llm,\n",
    "    embedding_model=embed_model,\n",
    ")\n",
    "JSON(kb.model_dump())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7c217f7f8cf956d8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:26.736798Z",
     "start_time": "2025-04-15T01:31:26.516452Z"
    }
   },
   "outputs": [],
   "source": [
    "# Reset all the data of knowledge base.\n",
    "kb.reset()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4ac8a82485d4232",
   "metadata": {},
   "source": [
    "### Custom Chunker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cddfe61c16ee934e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:26.744529Z",
     "start_time": "2025-04-15T01:31:26.740821Z"
    }
   },
   "outputs": [],
   "source": [
    "from autoflow.chunkers.text import TextChunker\n",
    "from autoflow.configs.chunkers.text import TextChunkerConfig\n",
    "\n",
    "text_chunker = TextChunker(config=TextChunkerConfig(chunk_size=256, chunk_overlap=20))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cfc2d80",
   "metadata": {},
   "source": [
    "### Import documents from files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f729326f",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:57.520138Z",
     "start_time": "2025-04-15T01:31:26.749953Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0196384b-e01b-7e87-8ae3-ceaebc5ea4f0</td>\n",
       "      <td>---\\ntitle: What is TiDB Self-Managed\\nsummary: Learn about the key features and usage scenarios of TiDB.\\naliases: ['/docs/dev/key-features/','/tidb/dev/key-features','/docs/dev/overview/']\\n---\\n\\n# What is TiDB Self-Managed\\n\\n&lt;!-- Localization note for TiDB:\\n\\n- English: use distributed SQL, and start to emphasize HTAP\\n- Chinese: can keep \"NewSQL\" and emphasize one-stop real-time HTAP (\"一栈式实时 HTAP\")\\n- Japanese: use NewSQL because it is well-recognized\\n\\n--&gt;\\n\\n[TiDB](https://github.com/pingcap/tidb) (/'taɪdiːbi:/, \"Ti\" stands for Titanium) is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. The goal of TiDB is to provide users with a one-stop database solution that covers OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP services. TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0196384b-e01b-7ebc-9a82-f51dac13ba5c</td>\n",
       "      <td>TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.\\n\\nTiDB Self-Managed is a product option of TiDB, where users or organizations can deploy and manage TiDB on their own infrastructure with complete flexibility. With TiDB Self-Managed, you can enjoy the power of open source, distributed SQL while retaining full control over your environment.\\n\\nThe following video introduces key features of TiDB.\\n\\n&lt;iframe width=\"600\" height=\"450\" src=\"https://www.youtube.com/embed/aWBNNPm21zg?enablejsapi=1\" title=\"Why TiDB?\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen&gt;&lt;/iframe&gt;\\n\\n## Key features\\n\\n- **Easy horizontal scaling**\\n\\n  The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.\\n\\n- **Financial-grade high availability**\\n\\n  Data is stored in multiple replicas, and the Multi-Raft protocol is used to obtain the transaction log.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0196384b-e01b-7ed4-bf69-f65c861aedf7</td>\n",
       "      <td>A transaction can only be committed when data has been successfully written into the majority of replicas. This guarantees strong consistency and availability when a minority of replicas go down. You can configure the geographic location and number of replicas as needed to meet different disaster tolerance levels.\\n\\n- **Real-time HTAP**\\n\\n  TiDB provides two storage engines: [TiKV](/tikv-overview.md), a row-based storage engine, and [TiFlash](/tiflash/tiflash-overview.md), a columnar storage engine. \\n\\n  TiFlash uses the Multi-Raft Learner protocol to replicate data from TiKV in real time, ensuring consistent data between the TiKV row-based storage engine and the TiFlash columnar storage engine. TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.\\n\\n- **Cloud-native distributed database**\\n\\n  TiDB is a distributed database designed for the cloud, providing flexible scalability, reliability, and security on the cloud platform. Users can elastically scale TiDB to meet the requirements of their changing workloads.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0196384b-e01b-7ee1-91f0-d57434e5d74d</td>\n",
       "      <td>Users can elastically scale TiDB to meet the requirements of their changing workloads. In TiDB, each piece of data has at least 3 replicas, which can be scheduled in different cloud availability zones to tolerate the outage of a whole data center. [TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-operator-overview) helps manage TiDB on Kubernetes and automates tasks related to operating the TiDB cluster, making TiDB easier to deploy on any cloud that provides managed Kubernetes. [TiDB Cloud](https://pingcap.com/tidb-cloud/), the fully-managed TiDB service, is the easiest, most economical, and most resilient way to unlock the full power of [TiDB in the cloud](https://docs.pingcap.com/tidbcloud/), allowing you to deploy and run TiDB clusters with just a few clicks.\\n\\n- **Compatible with the MySQL protocol and MySQL ecosystem**\\n\\n  TiDB is compatible with the MySQL protocol, common features of MySQL, and the MySQL ecosystem. To migrate applications to TiDB, you do not need to change a single line of code in many cases, or only need to modify a small amount of code.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0196384b-e01b-7eed-a470-c9bcb5a7eccc</td>\n",
       "      <td>In addition, TiDB provides a series of [data migration tools](/ecosystem-tool-user-guide.md) to help easily migrate application data into TiDB.\\n\\n## See also\\n\\n- [TiDB Architecture](/tidb-architecture.md)\\n- [TiDB Storage](/tidb-storage.md)\\n- [TiDB Computing](/tidb-computing.md)\\n- [TiDB Scheduling](/tidb-scheduling.md)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     id  \\\n",
       "0  0196384b-e01b-7e87-8ae3-ceaebc5ea4f0   \n",
       "1  0196384b-e01b-7ebc-9a82-f51dac13ba5c   \n",
       "2  0196384b-e01b-7ed4-bf69-f65c861aedf7   \n",
       "3  0196384b-e01b-7ee1-91f0-d57434e5d74d   \n",
       "4  0196384b-e01b-7eed-a470-c9bcb5a7eccc   \n",
       "\n",
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        text  \n",
       "0                                                                                          ---\\ntitle: What is TiDB Self-Managed\\nsummary: Learn about the key features and usage scenarios of TiDB.\\naliases: ['/docs/dev/key-features/','/tidb/dev/key-features','/docs/dev/overview/']\\n---\\n\\n# What is TiDB Self-Managed\\n\\n<!-- Localization note for TiDB:\\n\\n- English: use distributed SQL, and start to emphasize HTAP\\n- Chinese: can keep \"NewSQL\" and emphasize one-stop real-time HTAP (\"一栈式实时 HTAP\")\\n- Japanese: use NewSQL because it is well-recognized\\n\\n-->\\n\\n[TiDB](https://github.com/pingcap/tidb) (/'taɪdiːbi:/, \"Ti\" stands for Titanium) is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. The goal of TiDB is to provide users with a one-stop database solution that covers OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP services. TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.  \n",
       "1  TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.\\n\\nTiDB Self-Managed is a product option of TiDB, where users or organizations can deploy and manage TiDB on their own infrastructure with complete flexibility. With TiDB Self-Managed, you can enjoy the power of open source, distributed SQL while retaining full control over your environment.\\n\\nThe following video introduces key features of TiDB.\\n\\n<iframe width=\"600\" height=\"450\" src=\"https://www.youtube.com/embed/aWBNNPm21zg?enablejsapi=1\" title=\"Why TiDB?\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\\n\\n## Key features\\n\\n- **Easy horizontal scaling**\\n\\n  The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.\\n\\n- **Financial-grade high availability**\\n\\n  Data is stored in multiple replicas, and the Multi-Raft protocol is used to obtain the transaction log.  \n",
       "2                                                                       A transaction can only be committed when data has been successfully written into the majority of replicas. This guarantees strong consistency and availability when a minority of replicas go down. You can configure the geographic location and number of replicas as needed to meet different disaster tolerance levels.\\n\\n- **Real-time HTAP**\\n\\n  TiDB provides two storage engines: [TiKV](/tikv-overview.md), a row-based storage engine, and [TiFlash](/tiflash/tiflash-overview.md), a columnar storage engine. \\n\\n  TiFlash uses the Multi-Raft Learner protocol to replicate data from TiKV in real time, ensuring consistent data between the TiKV row-based storage engine and the TiFlash columnar storage engine. TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.\\n\\n- **Cloud-native distributed database**\\n\\n  TiDB is a distributed database designed for the cloud, providing flexible scalability, reliability, and security on the cloud platform. Users can elastically scale TiDB to meet the requirements of their changing workloads.  \n",
       "3                                                             Users can elastically scale TiDB to meet the requirements of their changing workloads. In TiDB, each piece of data has at least 3 replicas, which can be scheduled in different cloud availability zones to tolerate the outage of a whole data center. [TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-operator-overview) helps manage TiDB on Kubernetes and automates tasks related to operating the TiDB cluster, making TiDB easier to deploy on any cloud that provides managed Kubernetes. [TiDB Cloud](https://pingcap.com/tidb-cloud/), the fully-managed TiDB service, is the easiest, most economical, and most resilient way to unlock the full power of [TiDB in the cloud](https://docs.pingcap.com/tidbcloud/), allowing you to deploy and run TiDB clusters with just a few clicks.\\n\\n- **Compatible with the MySQL protocol and MySQL ecosystem**\\n\\n  TiDB is compatible with the MySQL protocol, common features of MySQL, and the MySQL ecosystem. To migrate applications to TiDB, you do not need to change a single line of code in many cases, or only need to modify a small amount of code.  \n",
       "4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       In addition, TiDB provides a series of [data migration tools](/ecosystem-tool-user-guide.md) to help easily migrate application data into TiDB.\\n\\n## See also\\n\\n- [TiDB Architecture](/tidb-architecture.md)\\n- [TiDB Storage](/tidb-storage.md)\\n- [TiDB Computing](/tidb-computing.md)\\n- [TiDB Scheduling](/tidb-scheduling.md)  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = kb.add(\"./fixtures/tidb-overview.md\", chunker=text_chunker)\n",
    "\n",
    "DataFrame(\n",
    "    [(c.id, c.text) for c in docs[0].chunks],\n",
    "    columns=[\"id\", \"text\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84fd9b606e6a11a5",
   "metadata": {},
   "source": [
    "### Search Documents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "259ad7a9",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:57.567655Z",
     "start_time": "2025-04-15T01:31:57.543046Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>---\\ntitle: What is TiDB Self-Managed\\nsummary: Learn about the key features and usage scenarios of TiDB.\\naliases: ['/docs/dev/key-features/','/tidb/dev/key-features','/docs/dev/overview/']\\n---\\n\\n# What is TiDB Self-Managed\\n\\n&lt;!-- Localization note for TiDB:\\n\\n- English: use distributed SQL, and start to emphasize HTAP\\n- Chinese: can keep \"NewSQL\" and emphasize one-stop real-time HTAP (\"一栈式实时 HTAP\")\\n- Japanese: use NewSQL because it is well-recognized\\n\\n--&gt;\\n\\n[TiDB](https://github.com/pingcap/tidb) (/'taɪdiːbi:/, \"Ti\" stands for Titanium) is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. The goal of TiDB is to provide users with a one-stop database solution that covers OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP services. TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.</td>\n",
       "      <td>0.726047</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.\\n\\nTiDB Self-Managed is a product option of TiDB, where users or organizations can deploy and manage TiDB on their own infrastructure with complete flexibility. With TiDB Self-Managed, you can enjoy the power of open source, distributed SQL while retaining full control over your environment.\\n\\nThe following video introduces key features of TiDB.\\n\\n&lt;iframe width=\"600\" height=\"450\" src=\"https://www.youtube.com/embed/aWBNNPm21zg?enablejsapi=1\" title=\"Why TiDB?\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen&gt;&lt;/iframe&gt;\\n\\n## Key features\\n\\n- **Easy horizontal scaling**\\n\\n  The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.\\n\\n- **Financial-grade high availability**\\n\\n  Data is stored in multiple replicas, and the Multi-Raft protocol is used to obtain the transaction log.</td>\n",
       "      <td>0.669803</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Users can elastically scale TiDB to meet the requirements of their changing workloads. In TiDB, each piece of data has at least 3 replicas, which can be scheduled in different cloud availability zones to tolerate the outage of a whole data center. [TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-operator-overview) helps manage TiDB on Kubernetes and automates tasks related to operating the TiDB cluster, making TiDB easier to deploy on any cloud that provides managed Kubernetes. [TiDB Cloud](https://pingcap.com/tidb-cloud/), the fully-managed TiDB service, is the easiest, most economical, and most resilient way to unlock the full power of [TiDB in the cloud](https://docs.pingcap.com/tidbcloud/), allowing you to deploy and run TiDB clusters with just a few clicks.\\n\\n- **Compatible with the MySQL protocol and MySQL ecosystem**\\n\\n  TiDB is compatible with the MySQL protocol, common features of MySQL, and the MySQL ecosystem. To migrate applications to TiDB, you do not need to change a single line of code in many cases, or only need to modify a small amount of code.</td>\n",
       "      <td>0.656657</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        text  \\\n",
       "0                                                                                          ---\\ntitle: What is TiDB Self-Managed\\nsummary: Learn about the key features and usage scenarios of TiDB.\\naliases: ['/docs/dev/key-features/','/tidb/dev/key-features','/docs/dev/overview/']\\n---\\n\\n# What is TiDB Self-Managed\\n\\n<!-- Localization note for TiDB:\\n\\n- English: use distributed SQL, and start to emphasize HTAP\\n- Chinese: can keep \"NewSQL\" and emphasize one-stop real-time HTAP (\"一栈式实时 HTAP\")\\n- Japanese: use NewSQL because it is well-recognized\\n\\n-->\\n\\n[TiDB](https://github.com/pingcap/tidb) (/'taɪdiːbi:/, \"Ti\" stands for Titanium) is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. The goal of TiDB is to provide users with a one-stop database solution that covers OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP services. TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.   \n",
       "1  TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.\\n\\nTiDB Self-Managed is a product option of TiDB, where users or organizations can deploy and manage TiDB on their own infrastructure with complete flexibility. With TiDB Self-Managed, you can enjoy the power of open source, distributed SQL while retaining full control over your environment.\\n\\nThe following video introduces key features of TiDB.\\n\\n<iframe width=\"600\" height=\"450\" src=\"https://www.youtube.com/embed/aWBNNPm21zg?enablejsapi=1\" title=\"Why TiDB?\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>\\n\\n## Key features\\n\\n- **Easy horizontal scaling**\\n\\n  The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.\\n\\n- **Financial-grade high availability**\\n\\n  Data is stored in multiple replicas, and the Multi-Raft protocol is used to obtain the transaction log.   \n",
       "2                                                             Users can elastically scale TiDB to meet the requirements of their changing workloads. In TiDB, each piece of data has at least 3 replicas, which can be scheduled in different cloud availability zones to tolerate the outage of a whole data center. [TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-operator-overview) helps manage TiDB on Kubernetes and automates tasks related to operating the TiDB cluster, making TiDB easier to deploy on any cloud that provides managed Kubernetes. [TiDB Cloud](https://pingcap.com/tidb-cloud/), the fully-managed TiDB service, is the easiest, most economical, and most resilient way to unlock the full power of [TiDB in the cloud](https://docs.pingcap.com/tidbcloud/), allowing you to deploy and run TiDB clusters with just a few clicks.\\n\\n- **Compatible with the MySQL protocol and MySQL ecosystem**\\n\\n  TiDB is compatible with the MySQL protocol, common features of MySQL, and the MySQL ecosystem. To migrate applications to TiDB, you do not need to change a single line of code in many cases, or only need to modify a small amount of code.   \n",
       "\n",
       "      score  \n",
       "0  0.726047  \n",
       "1  0.669803  \n",
       "2  0.656657  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = kb.search_documents(\n",
    "    query=\"What is TiDB?\",\n",
    "    top_k=3,\n",
    ")\n",
    "\n",
    "DataFrame(\n",
    "    [(c.text, c.score) for c in result.chunks],\n",
    "    columns=[\"text\", \"score\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2a0de8057cdf16b",
   "metadata": {},
   "source": [
    "### Search Knowledge Graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6fc5bc93",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:57.746250Z",
     "start_time": "2025-04-15T01:31:57.605589Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source_entity</th>\n",
       "      <th>relation</th>\n",
       "      <th>target_entity</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB Storage is an essential part of how TiDB manages data.</td>\n",
       "      <td>TiDB Storage</td>\n",
       "      <td>6.546173</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB provides TiKV as a row-based storage engine for data storage.</td>\n",
       "      <td>TiKV</td>\n",
       "      <td>6.256637</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB Computing describes the processing capabilities of the TiDB database.</td>\n",
       "      <td>TiDB Computing</td>\n",
       "      <td>5.975210</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB has key features that include easy horizontal scaling and financial-grade high availability.</td>\n",
       "      <td>Key features of TiDB</td>\n",
       "      <td>5.648048</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB provides strong consistency, ensuring that all transactions are immediately visible to users.</td>\n",
       "      <td>Strong Consistency</td>\n",
       "      <td>5.378570</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB Architecture is a key component of the TiDB database system.</td>\n",
       "      <td>TiDB Architecture</td>\n",
       "      <td>5.374958</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB is designed for high availability, ensuring operational continuity even during failures.</td>\n",
       "      <td>High Availability</td>\n",
       "      <td>5.220304</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB is MySQL compatible, enabling users to utilize existing MySQL applications with minimal adjustments.</td>\n",
       "      <td>MySQL Compatibility</td>\n",
       "      <td>5.137373</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB offers OLAP services, enabling fast and interactive access to data for analytical purposes.</td>\n",
       "      <td>OLAP (Online Analytical Processing)</td>\n",
       "      <td>5.021178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB provides a series of data migration tools to help easily migrate application data into TiDB.</td>\n",
       "      <td>data migration tools</td>\n",
       "      <td>5.002972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB provides TiFlash as a columnar storage engine that replicates data from TiKV.</td>\n",
       "      <td>TiFlash</td>\n",
       "      <td>4.756693</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB features horizontal scalability, allowing it to expand capacity by adding more machines to the cluster.</td>\n",
       "      <td>Horizontal Scalability</td>\n",
       "      <td>4.715631</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB provides OLTP services, facilitating transaction-oriented applications for data entry and retrieval.</td>\n",
       "      <td>OLTP (Online Transactional Processing)</td>\n",
       "      <td>4.683033</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB supports Hybrid Transactional and Analytical Processing (HTAP) workloads, allowing for simultaneous handling of transactional and analytical tasks.</td>\n",
       "      <td>Hybrid Transactional and Analytical Processing (HTAP)</td>\n",
       "      <td>4.431353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB Self-Managed is a product option of TiDB that provides users with the ability to deploy and manage TiDB on their own infrastructure.</td>\n",
       "      <td>TiDB Self-Managed</td>\n",
       "      <td>4.256526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB Scheduling is involved in managing the execution of tasks within the TiDB database.</td>\n",
       "      <td>TiDB Scheduling</td>\n",
       "      <td>4.245480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB is designed as a cloud-native distributed database providing flexible scalability and reliability.</td>\n",
       "      <td>Cloud-native</td>\n",
       "      <td>4.219989</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>TiDB</td>\n",
       "      <td>TiDB uses the Multi-Raft protocol to ensure high availability by managing transaction logs across multiple replicas.</td>\n",
       "      <td>Multi-Raft protocol</td>\n",
       "      <td>3.848345</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Key features of TiDB</td>\n",
       "      <td>Another key feature of TiDB is financial-grade high availability, which is achieved through data replication.</td>\n",
       "      <td>Financial-grade high availability</td>\n",
       "      <td>3.475075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Key features of TiDB</td>\n",
       "      <td>One of the key features of TiDB is easy horizontal scaling, which allows for flexible resource management.</td>\n",
       "      <td>Easy horizontal scaling</td>\n",
       "      <td>3.398429</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           source_entity  \\\n",
       "0                   TiDB   \n",
       "1                   TiDB   \n",
       "2                   TiDB   \n",
       "3                   TiDB   \n",
       "4                   TiDB   \n",
       "5                   TiDB   \n",
       "6                   TiDB   \n",
       "7                   TiDB   \n",
       "8                   TiDB   \n",
       "9                   TiDB   \n",
       "10                  TiDB   \n",
       "11                  TiDB   \n",
       "12                  TiDB   \n",
       "13                  TiDB   \n",
       "14                  TiDB   \n",
       "15                  TiDB   \n",
       "16                  TiDB   \n",
       "17                  TiDB   \n",
       "18  Key features of TiDB   \n",
       "19  Key features of TiDB   \n",
       "\n",
       "                                                                                                                                                    relation  \\\n",
       "0                                                                                                TiDB Storage is an essential part of how TiDB manages data.   \n",
       "1                                                                                         TiDB provides TiKV as a row-based storage engine for data storage.   \n",
       "2                                                                                 TiDB Computing describes the processing capabilities of the TiDB database.   \n",
       "3                                                          TiDB has key features that include easy horizontal scaling and financial-grade high availability.   \n",
       "4                                                         TiDB provides strong consistency, ensuring that all transactions are immediately visible to users.   \n",
       "5                                                                                          TiDB Architecture is a key component of the TiDB database system.   \n",
       "6                                                              TiDB is designed for high availability, ensuring operational continuity even during failures.   \n",
       "7                                                  TiDB is MySQL compatible, enabling users to utilize existing MySQL applications with minimal adjustments.   \n",
       "8                                                           TiDB offers OLAP services, enabling fast and interactive access to data for analytical purposes.   \n",
       "9                                                          TiDB provides a series of data migration tools to help easily migrate application data into TiDB.   \n",
       "10                                                                        TiDB provides TiFlash as a columnar storage engine that replicates data from TiKV.   \n",
       "11                                              TiDB features horizontal scalability, allowing it to expand capacity by adding more machines to the cluster.   \n",
       "12                                                 TiDB provides OLTP services, facilitating transaction-oriented applications for data entry and retrieval.   \n",
       "13  TiDB supports Hybrid Transactional and Analytical Processing (HTAP) workloads, allowing for simultaneous handling of transactional and analytical tasks.   \n",
       "14                 TiDB Self-Managed is a product option of TiDB that provides users with the ability to deploy and manage TiDB on their own infrastructure.   \n",
       "15                                                                  TiDB Scheduling is involved in managing the execution of tasks within the TiDB database.   \n",
       "16                                                   TiDB is designed as a cloud-native distributed database providing flexible scalability and reliability.   \n",
       "17                                      TiDB uses the Multi-Raft protocol to ensure high availability by managing transaction logs across multiple replicas.   \n",
       "18                                             Another key feature of TiDB is financial-grade high availability, which is achieved through data replication.   \n",
       "19                                                One of the key features of TiDB is easy horizontal scaling, which allows for flexible resource management.   \n",
       "\n",
       "                                            target_entity     score  \n",
       "0                                            TiDB Storage  6.546173  \n",
       "1                                                    TiKV  6.256637  \n",
       "2                                          TiDB Computing  5.975210  \n",
       "3                                    Key features of TiDB  5.648048  \n",
       "4                                      Strong Consistency  5.378570  \n",
       "5                                       TiDB Architecture  5.374958  \n",
       "6                                       High Availability  5.220304  \n",
       "7                                     MySQL Compatibility  5.137373  \n",
       "8                     OLAP (Online Analytical Processing)  5.021178  \n",
       "9                                    data migration tools  5.002972  \n",
       "10                                                TiFlash  4.756693  \n",
       "11                                 Horizontal Scalability  4.715631  \n",
       "12                 OLTP (Online Transactional Processing)  4.683033  \n",
       "13  Hybrid Transactional and Analytical Processing (HTAP)  4.431353  \n",
       "14                                      TiDB Self-Managed  4.256526  \n",
       "15                                        TiDB Scheduling  4.245480  \n",
       "16                                           Cloud-native  4.219989  \n",
       "17                                    Multi-Raft protocol  3.848345  \n",
       "18                      Financial-grade high availability  3.475075  \n",
       "19                                Easy horizontal scaling  3.398429  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kg = kb.search_knowledge_graph(\n",
    "    query=\"What is TiDB?\",\n",
    ")\n",
    "\n",
    "# Notice: score is the result of a weighted formula\n",
    "\n",
    "DataFrame(\n",
    "    [\n",
    "        (r.source_entity.name, r.description, r.target_entity.name, r.score)\n",
    "        for r in kg.relationships\n",
    "    ],\n",
    "    columns=[\"source_entity\", \"relation\", \"target_entity\", \"score\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1f1920c",
   "metadata": {},
   "source": [
    "### Ask question"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "54bab89a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "TiDB is an open-source distributed SQL database designed to support Hybrid Transactional and Analytical Processing (HTAP) workloads. It is compatible with MySQL, allowing users to leverage existing MySQL applications and tools with minimal changes. TiDB features several key attributes:\n",
       "\n",
       "1. **High Availability**: TiDB is designed to ensure operational continuity even during failures, providing financial-grade high availability by storing data in multiple replicas.\n",
       "\n",
       "2. **Strong Consistency**: It guarantees that all transactions are immediately visible to all users, ensuring a reliable and predictable database experience.\n",
       "\n",
       "3. **Horizontal Scalability**: TiDB allows for easy horizontal scaling by separating computing from storage, enabling users to scale out or scale in their computing or storage capacity online as needed.\n",
       "\n",
       "4. **Support for OLTP and OLAP**: TiDB provides a one-stop database solution that covers Online Transactional Processing (OLTP), Online Analytical Processing (OLAP), and HTAP services, making it suitable for various use cases that require high availability and strong consistency with large-scale data.\n",
       "\n",
       "5. **Cloud-native Architecture**: TiDB is designed for cloud environments, offering flexible scalability, reliability, and security on cloud platforms.\n",
       "\n",
       "6. **Data Migration Tools**: TiDB includes a series of data migration tools to facilitate the easy transfer of application data into the TiDB database.\n",
       "\n",
       "7. **Storage Engines**: TiDB utilizes two storage engines: TiKV, a row-based storage engine, and TiFlash, a columnar storage engine that replicates data from TiKV in real time.\n",
       "\n",
       "Overall, TiDB aims to provide users with a robust and flexible database solution that can adapt to changing workloads and requirements."
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import Markdown\n",
    "\n",
    "res = kb.ask(\"What is TiDB?\")\n",
    "Markdown(res.message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ed0149fb5a9e1cb",
   "metadata": {},
   "source": [
    "### Reset the KnowledgeBase"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "4303dc61b3f073f1",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-04-15T01:31:57.880832Z",
     "start_time": "2025-04-15T01:31:57.878931Z"
    }
   },
   "outputs": [],
   "source": [
    "# kb.reset()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
